An Assessment of character-based Chinese News Filtering Using Latent Semantic Indexing

نویسندگان

  • Shih-Hung Wu
  • Pey-Ching Yang
  • Von-Wun Soo
چکیده

We assess the Latent Semantic Indexing (LSI) approach to Chinese information filtering. In particular, the approach is for Chinese news filtering agents that use a character-based and hierarchical filtering scheme. The traditional vector space model is employed as an information filtering model, and each document is converted into a vector of weights of terms. Instead of using words as terms in the IR nominating tradition, terms refer to Chinese characters. LSI captures the semantic relationship between documents and Chinese characters. We use the Sin-gular-value Decomposition (SVD) technique to compress the term space into a lower dimension which achieves latent association between documents and terms. The results of experiments show that the recall and precision rates of Chinese news filtering using the character-based ap-proach incorporating the LSI technique are satisfactory.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spam Filtering Based on Latent Semantic Indexing

In this paper, a study on the classification performance of a vector space model (VSM) and of latent semantic indexing (LSI) applied to the task of spam filtering is summarized. Based on a feature set used in the extremely widespread, de-facto standard spam filtering system SpamAssassin, a vector space model and latent semantic indexing are applied for classifying e-mail messages as spam or not...

متن کامل

Speech Indexing Using Semantic Context Inference

This study presents a novel approach to spoken document retrieval based on semantic context inference for speech indexing. Each recognized term in a spoken document is mapped onto a semantic inference vector containing a bag of semantic terms through a semantic relation matrix. The semantic context inference vector is then constructed by summing up all the semantic inference vectors. Such a sem...

متن کامل

Indexing Audio Documents by using Latent Semantic Analysis and SOM

This paper describes an important application for state-of-art automatic speech recognition , natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection an...

متن کامل

Japanese-Chinese Cross-Language Information Retrieval: An Interlingua Apporach

Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an i...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJCLCLP

دوره 3  شماره 

صفحات  -

تاریخ انتشار 1997